DOCQIBIT lESOMB 



10 1<l7 3€1 
TITLE 

IHSTITOIIOI 
POB DAT! 
■0TB 

AfAILABIE FBOa 



BOBS PBICE 
DESCBIPIOBS 



lOBHTIFIBBS 



TB 006 BIS 

Guidelines cautions for Ccnsidericg 
Criteiicn-fieferenced Isstiog. 

Batioiial Education IssociatixD* Bashington, D.C. 
Aug 75 

21p.: Also available iar ED 106 233^ 

National Education issociatioD» 1201 16th :icceet« 

B. B.« Bashington, D.C. 20036 (Piice not available) 

BF'10.83 Pliis postage. BC Not Available froa ECBS. 
*Achieveaent Tests; *C£ite£icn Befecenced Tests; > 
*Dia9no8tic Tests; *Edacational Objectives; 
*Guidelines; icra Beferenced Tests; standardized 
Tests; Standards; *Teach1er Bole; Sest Bias; Test 
Construction 

Oopain Beferenced Tests; *0bj6ctive Beferenced 
Tests 



AfiSTBACI 

The opinions presented reflect these of the National 
Education Association Task Force on Testing: that nors referenced 
tests are often aisused, and that probless axe also associated vith 
criterion referenced tests. A list of warnings for teachers 
considezinV the use of criterion referenced tests includes: (1) 
cosson deficiencies in testing need tc be cossttnicated both to the 
profession and to the public, (2) teachers should have an extensive 
role, fros the beginning, in detersining objectives, (3) the claiss 
of criterion, objective, and dosain referenced tests should be viewed 
with sose skepticiss, but vith an open sind, («) teachers should 
obtain inforsation on field testing, reliabilitj, and validiti of the 
tests thej use, (5) teachers should vigorously resist the lisuse of 
all kinds of tests, and (6) teachers should net alios thesselves to 
be evaluated according to the results of any test. (Author/CIB) 



Docusents acquired bj EBIC include sanj itfcrsal unpublished * 
■aterials not available fros other sources. EBIC sakes every effort * 
to obtain the best copy available. Bevertheless, itess of sarglnal * 
-reproducibility are often encountered and this affects the quality * 
of the sicrofiche and hardcopy reproductions EBIC aakes available * 
via the BBlC Docusent Beproduction Service (BOBS). EDBS is not * 
responsible for the quality of the. original docusent. Beproductions * 
supplied by BOBS are the best thst can be sadf fros; the original. * 
*«****«*««««**ej»«**««e*««*«*«««««««««eeee*«e «««*««««*«««««••«•«••««*«•« 



* 
e 

,« 



ERIC 



GUIDELINES AND CAUTIOMS FOR CONSIDERING 
CRITERION -BE7EHENCED TESTING 

1^ 0» OiP*l«TM€MTO»Mt»tTM. 

S*TlOM*llM»TIT0T«0F 
^ I tOUCATION 

eloC*"oN POS.T.ON OR POL.CY 



"PERMISSION TO REPRODUCE THIS 
MATERIAL IN MICROFICHE ONLY 
HAS BEEN GRANTED BY 

TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTbR fERiC) AND 
USERS OF THE ERIC SYSTEM " 

\ 



PubllihAd by tlw 

HATIONAL EDUCATION ASSOCUTION 
1201 - 16th StrMt, N.W., Wachingtoa, D.C. 20036 



■ ERJC 



August 1975 



PREFACE 




When th0 NEA Task Forqe on Testing began to conclude that the great 
preponderance of standardized tests likely do more harm than good, they 
Initiated an exploration of alternatives. In doing so, they recognized the 
potential of crlterlon«ref erf need testing for alleviating some of the de- 
structive effects of standardized measures and engaged In an examination 
of the concept. 

The Task Force's conclusions about the usefulness of criterion-referenced 
tests (CRT's) resulted In mixed reviews. While they were aware that teachers 
have long used CRT's In sooie form (the Friday quiz based on the Instructional 
objectives for the week is one example), they hoped to Identify more refined 
adaptations. Increaeed paychometrlc sophistication was indeed found, but 
frequently It was without correction of the major deficiencies In standardized 
tests. As a reault of their deliberations and findings, the Task Force gave 
direction to the development of the guidelines and cautions that follow* The 
15 caveata present a atrong point of view on crlterlon^ref^eneed testing, one 
to which teacher aasoclatlona will want to give serlouj attention where crltetlon 
referenced tests are propoeed or already In uae. 

For those vho need to brush up a bit on aome commonly used definitions 
related to tests* a gloesary of measurement terme la appended. 



Bernard H. MeKenna 
Profeeslonal Aaaoclate 

NBA Inatructlon and Professional Development 



GUIDELINES AND CAUTIONS 
FOR CONSIDERING CRITERION -REFERENCED TESTING 

Standardized achievement testa used in most schools today are known 
as "norm -referenced" tests. They are constructed in such a way as to maximize 
differences among students so that one can be compared to another* This is 
done by providing for mBximm discrimination between high and low scores. The 
purpose is to rank a stiident among his peers. Hence, scores are reported in 
auch terma aa "John Jonea Is in the ninety-fifth percentile on verbal reasoning." 
While norm -referenced tests are useful for sorting people into categories (to 
thm dlsiaay of many), they are not useful for improving educational programs. 

Recently a new concept has been promoted among test makers and the 
educational public called criterion-referenced testing , also termed objective- 
referenced testing . At least three factors have contributed to emergence of 
the concept: First is a strong and rising dissatisfaction with tests in gen- 
eral. Second ia the inadequacy of traditional tests for diagnostic and in- 
atructlonal purpoaea. Third, there is some clamor for evaluating Instruct loci 
and teachera as part of the accountability movement. Although criterion or 
objective -referenced teata may have potential for diagnosing learning problems 
and improving Instruction, they are not useful for evaluating teachera. (Teat 
scores depend largely on variables in a student *s background rather than on what he 
or ahe la taught in the classroom. Even so, a few y*^^ < bill was introduced 
In the Kansas leglalature to cut off funda to d Is ti^cts whose children did not 
scofi above the national average on auch teata. Fortunately the bill did not 
pot*) 

Criterion -referenced tests. Instead of comparing one child to another, 
presumably meaaure the child *a performance, agalnat a specified criterion (or 

ERIC * 



- 2 - . . 

objective). Thus all children might be able to achieve the criterion and 
eventually score 100 percent on the testa. The criterion^referenced test, Xn 
concept, is nuch like the kind of test the teacher gives in the classrocxn on 
Friday to evaluate learning of specific objectives taught earlier in the week. 

Conceivably the external criterion toward which the test is directed 
could be a number ot things. For example, one could have a criterlon*referenced 
test for measuring the skills of a bricklayer (no doubt a nonverbal test) Can 
he lay bricks? Can he mix mortar? — without reference to how others do. 
The higher an Individual scored on the teat, the closer he would be to acquiring 
a bricklayer's skills, regardless of how many other people had the same skills. 

Test makers, however, have show^ little Inclination to develop tests 
directed toward such criteria./. Establishing a^sequence of skills and validating 
them is a laborious, difficult, multlyear task at best. Staying with the example 
of the bricklayer, they would have to conduct studies to show that good brick* 
layers score high on the test; that Is, they would have to evaluate the test. 
Test makers Instead l)ave resorted to a conception of criterion-referenced tests 
aa .those which yield meaaurements ^'directly Interpretable in terms of specified 
performance standards" (G laser and Witko, 1971). In practice, this means that 
the criterion toward which the test is directed is usually a prespecified ob- 
jective (an objective stated in advance, e.g., '*A bricklayer must be able to mix 
mortar*'). 

Thus ''criterion -referenced'* usually means in practice "objective -referenced." 
In fact, those who have moat atrongly propagated criterion-referenced testing 
ere frequently the same peraona who have propagated behavioral objectivea. In 
^ typical procedure, objectivea are eatabliahed and teat items are written to 

measure those olfjectlves. Test results can be reported in terma of what apeclfic 
objectlveii eaeh individual student was able to achieve, which preaumably is useful 

erJc 5 



for instructional purposes. In this V4y» it Is argued , tests can be tailored to 
specific objectives the way a teacher tailors test questions on^what he or she 
has taught* ^ - 

The distinction between criterion-referenced and norm*refere^ed tests Is 
quite blurred. Host test makers use similar procedures to construct items for 
both typesy or use the same Item, and employ test statistics for norm -referenced 
Items In selecting Items for criterion-referenced, tests. There are no clearly 

defined and coononly agreed upon procedures for constructing criterion-referenced 
^^^^^^^^^^^^^^^^^^^^^ • I . 

tests, and many of them are In fact norm -referenced tests In disguise > The 

distinct Ion .becomes a matter of emphasis rather than being clear-cut. 

Vomer (1973) defines a criterion-referenced test as — 

...one vhlch is designed to provide information about 
attaiflimit of a specific objective (criterion), which 
emphaalMs direct measurement through the use of dif- 
fering formetSt which may use items at varying difficulty' 
levels, which must haire content validity, which muat 
Binlaize' guessing, and which is particularly useful for 
Instructional and evaluative purposes. 

Vomer's ''differing formats" term indicates he is keen on test items 
which call for responses other than untltiple -choice. Maiqr criterion-referenced 
teats continue to be sMde up me inly of multiple -choice items. 

A main advantage claimed for criterion-referenced teats is their 
utility for improving educational programa. In view of the confuaion among 
teat makera themaelves about the concept, conatruction, and utility of the tests, 
soma caveats are in order for those v« isidering the use of criterion-referenced 
or objective -referenced tests. * 



r 

. 4 . 

1, Cowaon deficiencies In testing need to be comaunlcated both to the 
prof eee Ion end to the gubllc> NalLher criterion -referenced teef (CRT's) 
nor object lve->referenced tests (ORT's) eltnlnete the most coopon deficiencies 
of tests In general « ^ 

C^n's 'Snd ORT*s for the most pert still measure simple casks at thia 
expenae of relearnl;ig abllltlea and higher-level thought processes (Stake, 1973). 
Coaplex perfoxvanciis are so difficult to msasure that teat Itema reflect only 
the almpler taaka* Such thinga ea Blnet'a categorUa of mental imagery, laaglna- 
tlbn, aeathetlc appreciation, and moral senslbllir/ are almoat totally unmeasured. 

2e Teachers should examine carefully the derivation of the objectives for ORI'a e 

ORI'a can be no better than the objectlvea on which they are baa*d. Un-* 
fortunately, the msthoda for deriving objeetlvea are often lllrconsldered, haaty, 
and groaaly Inadequatee There la an Inclloatlon among teat makera to slide over 
the* problems* of deriving objectives in order to get to Item ccmotruction, a teak 
vith which they are mote familiar. Yet appropriate objeetlvea are Just as Im- 
portent and Juat aa difficult to arrive at aa are teat itema. 

There are at leaat four waya to chooae objectivms (Klein and Koaecoff , 1973). 
Firat, chooaing by expert judgment meana that a small g^up of aubject matter ex- 
pert, decldea vhlch objectlvia ahould be neaaured for h given field. Thia waa 
eaaeatially the tfrigin of , National AaaesaMnt testa. While few peraona would 
deny tb* nlevanee of the JudgMnta of rubject Mtter cxperta, few would contend 
tbet such JudipM.nts faithfully or ccapletely represent what ahould be tau^t. By 
no Mona do they fully repreaent the Jud^eenta of teachers, parents, atudenta, and 
othera vitally coneemed. 

A Mcond wey of chooaing objectivea la by conaenaua judgment , which requlrea 
that varioua grdupa teachera, adminiatratora, parents, achool board, etc. 
decide whet objectivea are moat Important. (In thia psptft '^objeetlvea'' refera 
ERXCm apecific atudent learning outeomea.) Unfortaoately, the iamMnae probleme of 



- 5 - 

such prioritizing htve been slighted. Frequently decision -making groups respond 
only to those objectives that are presented to them by a single group (e.g., 
school administrators) or a limited number of groups. Correcting for important 
objectives that have been omitted is not taken into account. If critical ob- 
jectives do not emerge from the objective -generating process they are ordinarily 
lost forever. For example, there is likely to emerge a high preponderance of 
content -bound objectives that are easily measurable. More subtle learnings are 
neglected. Attending to the objectives that are easily identifiable severely 
limits the range of deci»ion -makers' thinking and results in deteabining (and 
Halting) the curriculum. 

The rating of priority statements themselves is severely dependent upon 
how abstractly the objectives are specified (how' global they are), the types 
of criterU on which the objectives afre rated (Are they rated in "importance," 
' how much money will be spent on them, how much time and effort will be spent, 
and the nature of the groups doing the rating?) (Stake and Cooler, 1973; House, 
1973). Test makers have had little experience polling the opinions of non- 
professional groups, so surveys for the. purpose of developing or rating the ■ 
iaportanee of objectives are likely to be highly class-biased. Actually, such 
surveys are seldom done. Objectives generation and measurement are likely td 
be treated In the most cavalier fashion. Test -developers who would never think 
of including an item without field testing it sooetlaes accept and discard ob- 
jectives with abandon. A coonott procedure is to have the objectives reviewed 
by « SMll group of citizens and educators and claim that the objectives have 
been approved by the public. Those citizens involved are too frequently upper- 
middle-elass and the educators so selected that they are not broadly representative. 

A third way of deriving objectives is through curriculum analysis . One can 
Inspect oMterials such as textbooks or courses of study to aetermine what is being 
O tau^C end than write objectives and tut items based on such content. Much of 

ERIC 8' 



- 6 - 



the Ifflpe'cus for CRT's came f*om curriculum developers Uke those .who pioneered 

Indlvldaally rrescrlpted Instruction (IPI) as part of their efforts to develop 

f 

tests' that measure' exactly whet the materials teach. This pmredure also has 

Its llKltatlons in that It Is lliely to emphasize only content -related objectives. 

Fourth, objectives can be chosen by In-depth analysis or those instructional 
areas which one wishes to test. 0.^*» tries to datenalne tht crotents and behaviors 
in an area of instruction and to associate cbjectlves i.nd test items with contents 
and behaviors. In other words, by task analysis the Instruction is L*idcen into 
discrete learnings. The must ambitious efforts along this Une hav» rewlteU Ln 
-instruments called "domain -referenced tests" (Hlvely, 1973- Baker, 19.7); 
Mlllmaii, 1973). 

Domain-referenced testing (DRT) attempts to define "domains" r.f behavior — 
categories of behavior one might test and teach for — and to represent tiiese 
domains by an extensive pool of test items which measure human performance In 
a particular domain or domains. In one sense, domain-refertnced tests app-iar 
to be an attempt to escape the triviality and absurdity of much of the behavioral 
objectives movement. If one must delineate a highly specific objective for each 
aspect of student behavior, one might generate thousands of such objectives. In 
one project an attempt to define a comrlete set of objectives for the high school 
was given up after 20,000 objectives had been written. A complete delineation 
becones an absurdity and most such lists become trivial. 

DosMin -referenced testing alms at overcoming these problems by defining 
loportant categories of content and behavior so that only objectives representing 
particular domains become important. Other objectives are merely subsets or exampl 
The Instructional benefits of such a scheoe promise t* be large since one coulu 
prjictlce on other oojectives and test Items from tie domain to learn the behavior. 
One could always construct another test from the Innuaerab Le objectives and test 
O Items representing that domain. q 

ERIC ^ 



. 7 - 

DRT^s exist more in promise than in practice. No doubt the task analysts 
will confront the same fomldable conceptual problems as psychologists who have 
tried to categorize mental behavior and. curriculum developers who have tried to 
define the "structure'*^ of their subject. Even the most sophisticated schemes 
of human mental abilities » such as Bloom's Taxonomy, tend to falter when sub- 
jected to empirical examination. Human mental processes defy categorization 
which suggests emphasis on the long -debated principle of teaching to the whole 
child rather than to specific skills. 

3. Teachers should have an extensive role> from the beginning. In deriving 
objectives and should beware of co-optatlod . 

Most teacher (and public)' involvement, in developing objectives has been 
cursory at best more for the purpose^^ legitimizing the objectives than for 
determining or implenintlng them. For eatople, objective -referenced teits 
were developed for the state assessment program in Michigan and employed on 
a mandatory basis at selected grade levels. For the selected grades , subject 
specialists from the state education agency set up a small committee of educators, 
including four teachers, to select and review objectives. The committees developed 
goals which were later reviewed by subject matter associations. Then several 
one*day large group meetings were held around the state to give people a chance 
to respond* 

Despite this effort to Involve them, many of the teachers snd administrators 
who participated in the group meetings felt that they had not had adequate input 
on the objectives (House, Rivers, and Stuff lebeam, 1974). They were presented 
with a list of objectives and asked to respond after a cursory review. Most 
teachers in the state never saw or heard of the objectives. In spite of promises 
that the objectivas ware only for axperimental purposes, the state agency developed 
^ tests basad on them and adminis tared them the following yaar, claiming educator 
aadorseMnt. 10 



- 8 - 

4. Which obiectlves are selected aod retained for teaming la critical for 
OBT*a» Teachers ahould be intimately Involved from the bisglnntng In selecctng 
objectives > 

Selection of final objectives for testing Is as Important as generating 

them, and teachers are frequently provided only cursory participation in this 

activity also. In the Michigan assessment program over four hundred objectives 

were generated for fourth-grade mathematics, yet only thlrty-flve were selected 

for testing. The limiting factor was the amount of time required for testing 

each objective (It was deemed advisable not to exceed five hrurs of testing 

time). Which objectives were excluded? Why? If only the most important 

objectives were Included, how yas "Importance" determined? What would be the 

instructional effect over time of excluding the other several hundred objectives? 

% 

In most caaes of objective development the objectives are rewritten and screened 
by state education agency officials, select citizens' groups, and test makers. 
For example, in Illlnola, goals derived from public hearings were selected and 
^ extensively rewritten by several groups before being presented as public goals* 

> 

3* The waya in which test Ite^s are constructed shou^ be examined. When 
possible, teachers should employ their own test experts to help them i^sess the 
procedures . 

The usual number of items to meaaure one objective seeois to vary from 
three to five. Good reaulta have been obtained with five. Since even the most 
specific objective can be maaaured by thousands of test items, selection is im- 
portant. Sophlatlcated teat makers use a systematic aampllng plan that produces 
items for subcategorlea of the objectlvea. ^ 

Of at least equal importance is the type of response the item calls £or. 
Traditional teata use multiple-choice answers because they are easy to machine - 
^ acore. However, if the purpose of the test is to describe and dlagnoae claasroom 

ERiC 11 



. 0 . 

learning and provide usable Information to the teacher, uultiple*choiee answers 
may be much less desirable. The degree to which a test is a faithful sample 
of learning behavior is more Important in an objectlve^referenwed test than 
in one which nerely strives to differentiate among students. 

A group of items constructed by teachers is likely to be more relevant 
to the instruction of those particular teachers. Items written by measurement 
experts from a matrix of content and behavior are likely ,to be technically 
better but less relevant. 

6. CRT's and ORT's should be thoroughly fte^d tested. Teachers should refuse 
to use tests that have not been. thoroughly field tested . 

While this may seem a rather obvious "^aveat, the fact is that many 
objective-referenced tests have not been extensively tried out. Even where 
tried out^ frequently only a handful of students are involved. Tests with so little 
field testing should he resolutely avoided. The test developer should be required 
to present detaila of the field test. If he can't he probably hasn't conducted 
one, an all too common occurrence. 

■% * 

7. Test developers should present evidence of the test's reliability. TeacTiers 
should not use tests for which evidence on rellaHlity Is unavailable . 

For an OKI, eaeh set of items used to meaaure an objective might be ecnaidered 
a teat in itself* These should be reliable meaaures in and of themaelvea. The 
usual reliable determinants are test statistics which are measures of internal 
conaittency developed for traditional norm^-referenced tests. They are based 
on variationa in individual teat scores item difficulty and the differences 
between the top scorers aa oppoaed to bottom acorers, for example. The reliability 
will be hi^eat when about half of the atudenta get an item right and half get it 
^ wrong «• a aorm*refereneed concept mfucimizing discrimination among test takera. 

ERIC 



-lo- 
using theM f.radltlon«l techniques causes the tests to discriminate in the 
same way as qo items in standardized tests. Unfortunately, the ORT developers 
have not been able to solve this problem. The alternative is to have no evidence 
of reliability, which to many is even more unacceptable. Perhaps the best policy. 

is to insist on some measures of reliability, ones for which the * alopers 

/ 

supply a public rationale which can be assessed, y 

« 

8.- The test makers should present evidence of the valid ity of the tests. 
Teachers should inspect the validation procedu res carefully. 

Validity " which depeads upon the ability to unswer the question, "Does 
the test measure what it is supposed to?" — presents another difficult problem 
for the maker of criterion -referenced tests. For traditional norm-referenced 

^ tests, validity Is often established by how well the teat predicts concurrent 
academic grades. But this makes little sense for CRT's. Test de.elopers are 
usually left trying to make logical assessments of "content validity" based on 
how the tests were developed. 

If the test Is objective -referenced, one can assess whether teat Items 
adequately measure the jjectlves and whether the objectives themselves are 
valid for what the test is tryllng to measure. 

If the test purports to m^sure the effects of classroom instruction, then 
the obj' tlves must be the ones tau^t and the test Items must be sensitive to 
instruction . The Michigan assessment program tried a "sensitivity Index" 
td determine If correctly responding to an Item was dependent on Inatructlon. 
The Index didn't work In this situation. A highly specific objective mlghf be 
valid for one cl«s but not for another, and a test which presumes to be valid 
for assessing Instruction In s wtiole state has the probUm of demonstrating 

*^ that Its Items and objectlvts wart constructed In such s way as to Wb appropriate 
•tatwwlds — not sn easy task. The whole problem of validity U an unresolved 

UC OAS. but the burden of proof should fall on the test maker, not the buyer. 

^ 13 - ^ 



No matter what the derivation of the test or what it is called, unless 
it covers what q particular teacher has taught It cannot be a valid measure for 
that teaching situation; it is a measure of someone else's objectives. On the 
other hand, if the test is a measure of objectives which the teacher developed 
but which he Is willing to' accept as indi%:atf.ve of his instruction, then the 
objectives are valid for that teaching situation. 

* 

9« "Minimal competency" or "mastery" cut*off points for students should be 
viewed with some suspicion^. Teachers should question arbitrary standards «^nd , 
bstltute their own . , ^ 

Item difflcul,t;y on tests can be manipulated easily by test makers. Whether 
a student scores 30 percent or 88 percent can be built into the test itself 
and Just as easily changed by assigning arbitrary^ values to test items. Since 

there la no objective means by which tests can establish a level of satisfactory 

> 

"competency*'* the setting of such standards Is extremely arbitrary. What Is 
mlniaal competency In reading? When has ohe "mastered" reading? On the other • 
handy one may be willing to accept the opinions of certain groups as standards 
If they are clearly recognised as group opinion and subject to all the deficiencl 
that f^aplles. ! ' 

Nonetheless » many CRT developers continue to t^ulld hj^ghly arbitrary 
standards Into their tests. For example, the Michigan assessment is based on 
a minimal skill concept thet declares a stxadent muiit achieve 7$ percent of the 
mlir'.mal objectives. In the first year Of implementation some. of the districts 
vherr^ the hlghett academic achievement ml^t be expected yere able to achieve 
only 30 percent of some objectives. The 75 percent cut-off was evidently 
without Justification. 



- 12 - 

10, Many oblectlvc-referenced tests ari really tiorm -referenced tests in dlagulie . 
No teacher should voluntarily administer a test that hf does not understand , 

^ If one constructs objectives such as "reading a newspapet at a fourth- 
, grade level/' the norm is obviously built in. If one then selects test items 
using traditional test statistics » like item difficulty , and uses items from 
norm -referenced tests » the result is a test that discriminates among students 
but has the appearance of being referenced to skills rather than students. It 
becomes i nom-referenced test that looks like a criterion-referenced test. 
(Some test experts claim that it ia impossible to construct anything other than 
a norm-referenced test.) It is also possible. to use OKI results in a norm- 
referenced manner if one counts ttow many objectives each student learned and 
then makea cdmparisons among students. 

11. The public and the profeasion should be made aware that CR or ORT's are 
not panaceas. bias problems remain the same with CR or ORT'a as with no rm* 
referenced tests; 

Lower -socioeconomic groups will score as low on criterion or objective- 
referenced tests aa they do on norm-referenced teats. Basic factors such as 
malnutrition and lack of motivation toward school and teat taking are untouched 
by change from one type to another. What CRT's might offer seme students is a reprieve 
from being told they are inferior. (In aome districts test scores are attached 
to the report carda or even reported in the . newspapera . ) Since self-confidence 
seesii to be critical in achooling, lack of stigmatization couM^ be an Important 
advantage. Another advantage might be to spell out in greater detail where ^ 
certain educational weakneaaea of atudenta lie. Actually , CRT developera have 
done little that migfit reault in preventing racial claa^ school building, or 
neighborhood bias in their teata. 

^ 15 ^ 

ERIC . . . 



- 13 - 

12, CRT's could cost more than traditional tests, depending on the thoroughness 
of development. The costs of tests versus their utility should be carefully . 
considered , ^ 

Traditional norm*referenced tests already exist and do not need to be 
developed, so If CRT superiority can't be positively demonstrated, the question 
should be raised, '%Riy go to the extrr. time and expense?" Also, because of 
their greater specificity, consider that CRT's might be valid for cfnly a small 
domain of behavior at a given point in time (there could be large rewards in 
this, of course, in promoting learning). Many more tests vould have to be 
developed rather than a few general ones. The procedure of developing and 
validating objectives and test items is a long, difficult, and costly pro- 
cedure when properly done, > 

there are two ways of reducing costs. One is based on the assunption that 
there are certain basic and necessary skills and stages of learning Independent 
of the local setting and that one need develop only one test for basic reading 
skilla and sell it to everyone. This is the assunption of the test makers . 
but It is a questionable one. Learning often seems to be highly context --dependent. 
Children learn in different ways In d if feres^set tings. The inability of educa^ 
tlonal research to come up with guaranteed teaching techniques and the inablllry 
of psychology to denonetrate transfer of training Indicates this is so. 

Another way of reducing cost< would be to have local groups of teachers 
develop £heir own CRT's aa they now do for their classrooms. But there it the 
question of whether the amount of tine required would be profitab|^y spent in test 
construction. (See chapter II, "Cooperative Development of Evaluation Systems 
for Student Uaming," in Bloom, Hastings, and Madaus, 197U) 



« 16 



I 



- 14 - 

13 • Teachers should not be evaluated on CBT^s and 0RT*8 any more than on norm - 
referenced testa. Teachers shduld not allov themselves to be evaluated on the 
basis of ANY tests , 

Tests^are not good measures of what Is taught in school.^ Although objec- 

— ^ - — " f 

tlve-referenced tests purport to be better measures of learning, they cannot 
be considi^ved good measures of teaching. An obvious deficiency is that the tests 
measure only cognitive aspects of the classroom. In addition, the teacher does 
not have control over many of the variables that affect test scores. Evaluating 
teachers is a uae that should not be claimed^ for OBT's. The evaluation of teaching 
should be based on observation, self -evaluation, student ratings, interviews, 
and many o^ner tjrpes of data, 

14. A main advantaRc of CRT's or OBT^s seems to be In the reporting of results , 
that Ist avoldiiig blanket categorizations of children by test scores and pro - 
viding more useful instructional information. Subtests should be used only 
as diagnostic 'ii|istrumeAts > 

Instead of a composite score Vth which the teacher can do little but 
type the child, in criterion or objective -referenced tes'fcing the teacher is 
presented with specific objectives the student can or cannot accomplish.; The 
""cvoldanc* of a single acore categorizing the child Is a major benefit. Pre- 
•uMbly the teacher also will be better able to make use. of the detailed ob|ipctives 
for iiaproving Instiruiition and learning. 

!t should be noted, 'however, that there is little evidence that a teacher 
ca» do a better Job working with specific objectives than working without them. 
Whether to use specific objectives should remain a matter of ptyle and Judgment 
for the individual teacher. Stake (1973) has indicated that there are significant 
costs In using behavioral objectives, including the possibility that the teacher 
ERJC will teach only wliet Is easy to measure.^ In Michigan, most teeehers did not find » , 

17 , ^ . 



th* ORr*t valuable for Instructional purposes (House, Rivers, and Stuff lebeam, 
1974) e Th6 Inatruetloaal benefits are also reduced by the limited number of 
objectives to which one can teach and for which one can reasonably test* 

15. While worthy of consideration, the clalns of cflterlon, objective, and 
dooia In-r eferenced tests should be viewed with sone skepticism but with an open 
Mind. Te achers should vigorously resist the misuse of all kinds of tests . 
^ In S9M, ways CRT's can be viewed as a response by die testing establish- 

nent to avoid some of the criticisms of tests. Such was the motivation In 
Michigan. CKT's and OKI's still eobody most of the deficiencies of tests in 
general and are not. useful for evaluating teachers la accountability schemes. 

tests are also difficult to construct and are subject to much conceptual 
confusion, even though they do offer the potential of being more useful for 
ins^uetloni, . ' 

An Important benefit of CR versus.norm-referenced fcests Is that with CRT's the 

I 

teat takftr la not atl^tlaed by a global score supposedly representing his/her 
ability. Thla la a great advantage. The beat use of teata la lA ralatng 
questions In the teacher* a mind about Individual students who achieve unuaual 
acorese Thu tests themselves may be In error, or the teacher* a preconception 
My bee In any caae, following up on seeoilng dlscrepanclea la the Job of the 
# profeaalbnale Teata should be used to ralae queetlona, not to reaolve them* 



o 18 

ERIC 



REFBBENCES 

B«|K«r9 Ev6 L. **Btyoad Objtetlvtt: DcNMiin*Rifcrtiic«d Ttftt for Evaluation 
and* laattitetloiuil laprovfatnt." Edwtlonal Tachnolcar , 19Y3, 

BloM, Baajaain J.; Haatlagfy J. Thoaaa; and Madaua, Gaorga F. Handbook 
on foCTiattva 4iod St—atlva Eva^atlon of Studant Laarntog > Nav York: 
MoGr|pf*Blll Book^Co., 1971. # 

fiUwT, B0b«rt,'aiid Vltko, Anthony J. "MaatunaMat In Uarnlng and 

Ittacrttctloa." Educatianal Haamreaant . (Edltad by Robart L/ Thorndlke.) 
Waahington, D.C: AMtlean Council on Education, 1971. pp. 625-70. 

Hlvaly, Walla. "Doauiin-Raferancad Taating." Educational Technology . 
1973. 

f 4 * 

Houaa, Ernaat &• **Valldating a Goal *>rlorlty Inatrunant.** Papar^^raaantad 
at tha annual Mating of Amarlcan Educational Raaaarch Aaaociation, Haw 
(Mrlaana^ Fabruary 25 -March 1, 1973. 

Houaay Imaat Uvarag Wandall; ai^ Stiifflabaam^ Dan. An Aaaaaamant of 
tha Michigan Accountability Syataa , Michigan Education Aaaociatioa / 
and National Education Aaaociation, March 1974. 

KlaiUy j^ttphian P.^ and Koaacof f ^ Jac^lina. laauaa and Procaduraa in^ 
tha Pavlopmnt of Critarion*>Kaf»rancad Taata a, Princatouy N.Ja: 
mc Claarlnghoin» on Taata, MtAauraMst, and Evaluation, Sapta8ri>ar 1973. 

Mlllwin, Jaaon. "Bern To lUka Aaaaafaant Plana for DoaMin-Xafafancad 
TaataiP Educational Taetoology . ^973. 

Pophaa, W.^JaoMa, and Mak, R. R. **Iaplicatiotta of Critaridn-Rafarancad 
llaaauxaMnt.** -Jouafnal of Educational Utaauraaant . 1969. 

Staka, Robert B. **Maaaurlttg What Ltarniartf Laam.**. School Evaluation . 

(Edited by Eraeat R. House.) Berkeley, Calif. : McCutchan Publiahing 
Corp., 1973. 

Jtdce, Robert R., end Cooler, Dennia. "Maaauring Goal Prioritiea." School 
K^luation. (Edited byKmeat R. Rouae.) Berkeley, Calif.: McCutchan 
lokUahint Corp., 1973. 

«QMr» Frank B« **IRiat it Critarlon*Rafarancad MraauraMnt?" IKA, 

Co— ittaa on ^ Evaluation of Raading Taata. " ^ 



4 

I * 

19 



- 17 . 

cxjossary of measurement terms* 



ACHIEVEMEHT TEST 



A'tMt that MAtuttt thii «Bount iMrmd by a ttudant, usually In 
acadtmie aubjaet mattar or baaie tkllla. ^ 



AFC ITUDE TEST 



A tatt contlating of Itama taleeted and ttacidardizad ao that the 
taat yitlda a score that ean ba used in pradleting a parson 
futura parformanca on tasks not avidantly aiaiilar to thos# in tha 
taat. Aptituda tasts may or nay not d if fir in eontant from achiava- 
mant tasta, but thay do differ In purpoaa. Aptitude testa eonaist 
of items that predict fntute learning of parfotmsnca; achievamant \ 
teats consist of Items that aaaple the adeqtlsey of past laamlng. j 



CKilERION 



° A atsniard or Judgment used aa a baals for quantitative and quel- ^ 
itativa comparison; that variable to vfaich a taat la compared to 
conatituta a maaaura of tha test's validity* For example, grade* 
point average and attainment of currieular objectivea ire often uaed 
as criteria for Judging the validity of an academic aptiti^e test. 

CRITElHMt^REFEBEllCEP TEST 

A test In ivhlch every ite^ la directly identified irlth an explicitly 
statfd educational bahaviotal objective. -The test Is designed to 
determine which of these objectives have been mastered by the examinee. 

CtADB HOHM ^ 

The, average teat scora^ obtained by atvdents claasif led at a given 
grade placement. 



LOCAL NDBMS ^ , 

Itotma that have been obtained from data collected In a limited locale, 
evch aa a school ayetem, county or atkte/ They may be used Insteed 
of national norma to evaluate atudant performance. 



MPtnyiE^caoici item 

—5—^ — . ^ ^ • ^ ^ 

A taat queatlon^eoi#iating'of a atam in the form of a direct queation . 
or laeoaplatii atatement and two or snre ananera, called alternatives 
or vaaponae choicaa. The examinee *e teak la to chooae from smong the 
^ altamatlves provided the beet ansmer to tha question posed in the stem. 




* Ibcaerpta from the revlaad edition of A Oloasary of'Kaaaurament Terms; A Basic 

S^¥ff ^ fe^^^^^gtf /•^^^L.??^^^ hy,cra7Me^rem>Hlll, Del Monte 
■MaMte^^aAt lfemt««y» »IU^U %tm^Ujxt 0 1973 . Reprinted by 



- 18 - 

HONVEKBAL TEST 

A t«ft in which the items consist of symbols, figures, n-jnbers, 
or pictures, but not words. 



mFORHANCE TEST «, 

A test thet requires the use and manipulation of physical objects 
and the application of physical and manual skills. Shorthand or 
typing tests, in which the response called for is similar to the 
behavior about which informajtlon is desired, exemplify work-sample 
tests, which are a type of performance test. 

KAHPOM SAMPUB 

k Mmple dram in such a way that every member of the population 
baa an equal chance of being included, thus eliminating selection 
biaa« A random aample is ''representative" of its total population. 




RELIABILITY 

^The ciynaiatency of tert scores obtained by the same individuals 
on differeht occaaiona or with different sets of equivalent itms; 
accuracy of aeorea* Several types of reliability coefficients 
should be distinguished. # 

Coefficient of internal consiatency is a measure "based on internal 
analysis of data obtained on a single trial of a test (Kuder-Richardaon 
formulae and the s^lit -half method using the Speatman-Broim formula). 

Coefficient of equivalence or alternate formdf reliability refera to 
a correlation between scores from two forma of a test given at 
approximately the aame time. 

Coefficient of atabllity or teat-retest teliability refers to a 
correlation between test and retest with some period of time Inter- 
vening. The test-retest situation' may be wirh two focma of the same 
teate 




STAH DARDIZEP TEST 
» ^^^"^^^^^^^ 

A teat conatructed of itema thiit are appropriate in difficulty and 
d leer imlna ting power for ths Intended examinees and that fit the pre* 
planned table of content apeclficatlon. The i:est la adminiatered in 
accordance with explicit directiona for unifprm adminlatration and la 
used with a manual that containa reliable norma for .the defined 
reference groupa. 



VALn>ITY 

'the «blliey of a te* to measure what it purports to measure. Many 
MtlMds aw usad to ast^llsh validity, depending on the test s 



Ptttpott. 21 



